Good and Bad Data Visualizations

Nii Amoo

2022-03-07

Why visualize data?

Good visualizations can give:

  • Powerful summaries of the underlying data

  • Communicate insights often to audiences who do not have the same luxury of spending so much time with the data as you do.

As a Data analyst/ Scientist, it’s your responsibility to give the necessary high level summaries or takeaways in any data visual you create.

Some Features of Good Visualizations

  • Clear on what they’re communicating

  • Well defined axis, with the right scaling and labels

  • Good choice of colors and anotations (visually appealing)

  • Less is more

Some Features of Bad Visualizations

  • Cluttered, too much going on in the chart with no clear communication goal

  • Truncating axes to start at non-zero values which distorts interpretation

  • Poor choice of colors

  • Unnecessary 3D-fying

Our data for today - Netflix Movies & TV Shows

library(tidyverse) # meta-package for data analysis in R
library(plotly) # creating interactive visualizations
library(DT) # nice table formatting


netflix <- read_csv("Data/netflix_titles.csv/netflix_titles.csv")
# head(netflix,5) %>% kbl() %>%
#   kable_styling()

# Get a high-level summary of the data
summary(netflix)
##    show_id              type              title             director        
##  Length:8807        Length:8807        Length:8807        Length:8807       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      cast             country           date_added         release_year 
##  Length:8807        Length:8807        Length:8807        Min.   :1925  
##  Class :character   Class :character   Class :character   1st Qu.:2013  
##  Mode  :character   Mode  :character   Mode  :character   Median :2017  
##                                                           Mean   :2014  
##                                                           3rd Qu.:2019  
##                                                           Max.   :2021  
##     rating            duration          listed_in         description       
##  Length:8807        Length:8807        Length:8807        Length:8807       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
## 
# For a more detailed summary of the data
skimr::skim(netflix)
Data summary
Name netflix
Number of rows 8807
Number of columns 12
_______________________
Column type frequency:
character 11
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
show_id 0 1.00 2 5 0 8807 0
type 0 1.00 5 7 0 2 0
title 0 1.00 1 104 0 8807 0
director 2634 0.70 2 208 0 4528 0
cast 825 0.91 3 771 0 7692 0
country 831 0.91 4 123 0 748 0
date_added 10 1.00 11 18 0 1714 0
rating 4 1.00 1 8 0 17 0
duration 3 1.00 5 10 0 220 0
listed_in 0 1.00 6 79 0 514 0
description 0 1.00 61 248 0 8775 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
release_year 0 1 2014.18 8.82 1925 2013 2017 2019 2021 ▁▁▁▁▇

Some Bad Visualizations

Example 1

# A pie chart of ratings 
pie_data <-  netflix %>% 
             filter(type == 'Movie') %>% 
             group_by(rating) %>%
             summarize(count = n())

pie_data %>% datatable()
## There's no simple function to create a bar chart in R using ggplot for a reason.
# pie_data %>% 
#     ggplot(aes(x = "", y = count, fill = rating)) +
#     geom_bar(stat = "identity", width = 1) +
#     coord_polar("y", start = 0) +
#     theme_void()


pie <- plot_ly(pie_data, labels = ~rating, values = ~count, type = 'pie')


pie <- pie %>% 
       layout(title = 'Top Netflix Movie ratings',width = 700, height = 500,
             xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
             yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
pie

What’s wrong with that plot?

The visualization is bad because:

  • It’s vague, putting together all movie ratings does help the audience identify what you’re trying to communicate.

  • The rating categories are too many. Remember, good visuals give high level summaries (less is more)

  • The pie chart used here is not the best tool for comparing multiple categories.

  • Pie charts also make it difficult for your audience to judge the relative sizes of the slices.

    Let’s look at another.

Example 2

# Movie Ratings over time
rating_data <- netflix %>% 
  select(type, release_year, rating) %>%
  group_by(release_year, rating, type) %>% 
  summarise(frequency = n())


# Initial line plot
rating_plot_1 <- rating_data %>%
    ggplot(aes(x = release_year,y = frequency,group = rating, color = rating)) +
    geom_line(size = 1.5)

rating_plot_1 %>% ggplotly(width = 800, height = 500)

Examples of Good Visualizations

Example 1

# A bar chart of ratings
Movie_bar_data <- netflix %>% filter(type == 'Movie') %>% group_by(rating) %>%
    summarize(count = n()) %>% arrange(desc(count)) %>% slice_head(n = 5)

bar <- Movie_bar_data %>%
    ggplot(aes(x = reorder(rating, -count, sum), y = count)) +
    geom_col(fill = c('#f9007a','#D6DBDF','#D6DBDF','#D6DBDF','#D6DBDF')) +
    xlab('Ratings') + ylab('Frequency') +
    ggtitle('Top 5 Movie Ratings on Netflix') +
    theme_minimal() +
    theme(
        legend.position = "none",
        plot.title = element_text(size = 18,face = "bold"),
        axis.title = element_text(size = 14)
    )

bar # %>% ggplotly(width = 800, height = 500,tooltip = FALSE)

Example 2

# Improve line chart to draw out insights
rating_data <- rating_data %>% 
              # Interested in Movies released from the year 2000 onwards 
              filter(type == "Movie" & release_year >= 2000 & release_year < 2020) %>%
              mutate( highlight = ifelse(rating == "TV-MA", "TV-MA", "Others"))

# Get overall growth rates over a 10 and 20 year period
min_year = min(rating_data$release_year)
max_year = max(rating_data$release_year)

# The number of movies with mature rating for the 2010, minimum and maximum years 
rate_range <- rating_data %>% 
              filter(release_year == min_year | release_year == max_year | release_year == '2010') %>%
              filter(highlight == 'TV-MA')

# Compute 20 year and 10 year growth rates
growth_20 <- (rate_range$frequency[3]/rate_range$frequency[1] - 1) * 100

growth_10 <- round((rate_range$frequency[3]/rate_range$frequency[2] - 1) * 100,0)


# Revised line plot
rating_plot_2 <- rating_data %>%
    ggplot( aes(x = release_year, y = frequency, group = rating, color = highlight)) +
    geom_line(size = 1.5) +
    scale_color_manual(values = c("#D6DBDF","#f9007a")) +
    xlab("Release year") + ylab("Number of Movies") +
    ggtitle('Increase In Mature Content Over The Last Decade') +
    theme_minimal() +
    theme(legend.position = "none") +
    geom_label( x = 2013.5, y = 300,
                label = glue::glue("Shows for Mature Audiences \n increased {growth_10}% over the last decade"),
                size = 4, color = "#34495E") +
    theme(
        legend.position = "none",
        plot.title = element_text(size = 18,face = "bold"),
        axis.title = element_text(size = 14)
    )

rating_plot_2 #%>% ggplotly(width = 800, height = 500)

Visit this github repo for the code.